Tagging a Corpus of Interpreted Speeches: the European Parliament Interpreting Corpus (EPIC)

نویسندگان

  • Annalisa Sandrelli
  • Claudio Bendazzoli
چکیده

The performance of three different taggers (Treetagger, Freeling and GRAMPAL) is evaluated on three different languages, i.e. English, Italian and Spanish. The materials are transcripts from the European Parliament Interpreting Corpus (EPIC), a corpus of original (source) and simultaneously interpreted (target) speeches. Owing to the oral nature of our materials and to the specific characteristics of spoken language produced in simultaneous interpreting, the chosen taggers have to deal with non-standard word order, disfluencies and other features not to be found in written language. Parts of the tagged sub-corpora were automatically extracted in order to assess the success rate achieved in tagging and lemmatisation. Errors and problems are discussed for each tagger, and conclusions are drawn regarding future developments.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ECPC: el discurso parlamentario europeo desde la perspectiva de los estudios traductológicos de corpus

This paper presents the main outcome of the ECPC research group: an archive of European parliamentary speeches created to study this genre and the hypothetical influence of translation in the construction of European identity. The archive is made up of, on the one hand, a parallel corpus containing the English and Spanish versions of the European Parliament proceedings, and on the other hand, t...

متن کامل

Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach

This study analyzes the political agenda of the European Parliament (EP) plenary, how it has evolved over time, and the manner in which Members of the European Parliament (MEPs) have reacted to external and internal stimuli when making plenary speeches. To unveil the plenary agenda and detect latent themes in legislative speeches over time, MEP speech content is analyzed using a new dynamic top...

متن کامل

Cleaning the Europarl Corpus for Linguistic Applications

We discovered several recurring errors in the current version of the Europarl Corpus originating both from theweb site of the European Parliament and the corpus compilation based thereon. The most frequent error was incompletely extracted metadata leaving non-textual fragments within the textual parts of the corpus files. This is, on average, the case for every second speaker change. We not onl...

متن کامل

A Critical Study of Selected Political Elites' Discourse in English

This study explored how political elites can contribute to power enactment through using language. It started with a theoretical overview of Critical Discourse Analysis (CDA), and then presented a corpus consisting of speeches of eight political elites, namely, Malcolm X, Noam Chomsky, Martin Luther King, Josef Stalin, Vladimir Lenin, Winston Churchill, J.F. Kennedy and Adolph Hitler. This stud...

متن کامل

Building an ASR Corpus Using Althingi's Parliamentary Speeches

Acoustic data acquisition for under-resourced languages is an important and challenging task. In the Icelandic parliament, Althingi, all performed speeches are transcribed manually and published as text on Althingi’s web page. To reduce the manual work involved, an automatic speech recognition system is being developed for Althingi. In this paper the development of a speech corpus suitable for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006